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ECONOMIC COMPARABILITY OF INFORMATION SYSTEMS* 

By Jacob Marschak and Koichi Miyasawa 1 

1. INTRODUCTION 

An information system is a set of potential messages to be received by 
the decision maker. It is characterized by the statistical relation of the 
messages to the payoff-relevant events, and also by the message cost . 2 Neglect- 
ing this cost, the (gross) value of an information system for a given user is 
the (gross) payoff that he would obtain, on the average, if he would respond 
to each message by the most appropriate decision. Thus (gross) information 
value depends not only on the statistical relation between messages and events 
but also on the payoff function. The latter expresses the user's “tastes” and 
“technology.” The ordering of statistically defined information systems by 
their values is therefore at most a partial one. This contrasts with the com- 
plete ordering of information systems (channels) by their equivocation (a sta- 
tistical parameter used in the classical information theory that disregards 
variation of payoff functions from user to user). 

Indeed, if “noise” is defined to increase with equivocation 3 a “noisy” infor- 
mation system may be more valuable to a given user than a noiseless one: 
the betting sports fan may have reason to prefer the sports page of a news- 
paper to its society page even though both pages have the same number of 
English words and the sports page has misprints, the society page none. 

The partial ordering of information systems by their (gross) values will 
be studied. In particular, conditions, sufficient or necessary, will be stated 
under which two systems are comparable, so that one of them is “more in- 

* Manuscript received February 28, 1966. 

1 This work was supported partly by the office of Naval Research under Task 047- 
041 and partly by the Western Management Science Institute under a grant from 
the Ford Foundation. Reproduction in whole or in part is permitted for any purpose 
of the United States Government. During 1964/5, Koichi Miyasawa, Professor of 
Statistics at the University of Tokyo, was post-doctoral Fellow under a Ford Foun- 
dation Grant to the University of California. 

The article is the product of very close collaboration although a few of the sections 
are predominantly the results of the work of one of the two authors (e.g., Sections 
1-7 in the case of Marschak [10, (12-4)]; and Theorem 8.2 in the case of Miyasawa). 
We owe thanks to Dr. Leif Appelgren, Stockholm, for the constructive criticism and 
to Professor Arthur Geoffrion of the Western Management Science Institute for many 
corrections and improvements. 

2 This cost will be assumed additive. We shall thus treat a special case only: see 
Section 4. 

3 See footnote 7 on the textbook terminology regarding “noiselesness” and “equivo- 
cation." 
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formative” than the other in the following sense: one of them can never 
have smaller value than the other for any payoff function defined on a given 
set of events. The ordering of information systems according to their inform- 
ativeness has applications in the economics of information and organization; 
and also, as shown by D. Blackwell, in statistics (where messages and events 
correspond, respectively, to observations and hypotheses). 4 

Sections 1-5 will define concepts useful for the statement of our problem 
of finding conditions for the comparability of information systems. The 
remaining sections seek to solve this problem and to prove inclusion relations 
between the various conditions studied. 

2. EVENTS AND DECISIONS 

For a given actor (decision-maker), we define a set X of states x of the 
environment (not controlled by the actor 5 ), a set C of Consequence c. Each func- 
tion from X to C is called an act by Savage [16]. However, not all acts 
thus defined are feasible. Define the set A consisting of all feasible acts, a. 
These we shall call actions. 6 The set A can be thought of as the resources 
or technology available to the particular actor. Each action a maps X into C 

a(x) = c . 

This is, in general, a many-to-one mapping, for each state x may describe 
the environment in unlimited detail, and some of this detail may be irrelevant 
in the following sense. Two distinct states, x and x' may be such that 
a(x) = a(x') for every a in A. We shall then say that x and x' are equiva- 
lent with respect to A. We can partition X into equivalence sets of the form 

(2.1. A) zi = {x f G X: a(x ') = a(x), all a G A] . 

Denote this partition by Z A . Each equivalence set z A in Z A may be called an 
event, relevant with respect to technology A; or, briefly, an A-relevant event. 

Example. Suppose X is an n-tuple of variables, some of them continuous, 
with n arbitrarily large; the consequences of any action of yours are not 
affected by many of these variables (political situation in Uganda) or by 
minor variations of the remaining ones (seconds of daily sunshine). 

The set of states of the environment can be “coarsened” still further if we 
pay attention, not only to the actor’s technology but also to his tastes. It 
would suffice for this purpose, to replace in (2.1. A) the equality sign between 
a(x f ) and a(x) by some symbol representing the actor’s indifference between 

4 Blackwell [2], Blackwell and Girschick [4], also McGuire [13]. 

5 Thus, in the case of sequential decisions, a state x would describe the time- 
seqence of external conditions. The sequence of states in the language of systems 
theory and dynamic programming corresponds (with the exception of the initial state) 
to a consequence in our terminology. 

6 In an earlier paper (Marschak [11]) the existence and uniqueness of the sets of 
payoff -relevant actions and events was proved, using a set of physically distinct feasible 
actions. We take this opportunity for a correction (due to K. Miyasawa): Theorem 3 
of [11] is not valid, nor is its validity necessary for the proof of the main existence and 
uniqueness theorem. 
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these two consequences. There is no need to use numerical utilities. But 
since the numerical utility concept will be needed soon anyway, denote by u 
the actor’s utility function, a function from C to the set of the reals, and write 

(2.2) w(x, a) = u(a(x)) = u(c ) ; 

the function co from X x A to the reals is called the payoff function. New 
partition X into equivalence sets of the form 

(2.1.ai) z* = {x' € X: w(x', a) = a)(x, a), all aeA}. 

This partition, denoted by Z ", is clearly coarser than Z A (i.e., Z A is a sub- 
partition of Z"). We shall call z ", a typical equivalence set in Z", an 
relevant with respect to the payoff function w ; or briefly, an cu-relevant event. 

Example. Suppose each consequence of any of your actions is a triple 
(c u c 2 , c 3 ) where Ci and c 2 are your profits of this and of the next year, 
measured in cents; and c 3 is the amount of air pollution created by your 
plant. Suppose you are indifferent between c = (c lt c 3 , c 3 ) and c' = (c(, c 3 , c 3 ) 
because you are not concerned with differences less than $1, or because you 
are willing to trade off a part of this year’s profit for a profit increase in 
the next year, or because air pollution results only in other people’s discom- 
fort. Then u(c) = u(c f ); and if, for all actions a in A, a(x) = c and a(x') = c', 
then the states x and x' belong to the same payoff-relevant event z w . 

Considerations of “taste” induce also a coarsening of the set of actions. It 
may happen that, for two distinct actions a and a', and for all x in X , 

u(a(x)) = u{a f {x))\ that is, w(x, a) = co(x, a') . 

Let us, then, define a partition of A into equivalence sets of the form 

(2.3) da = {a' e A: w(x, a') = w(x, a), all x € X} . 

For convenience we shall call the set of co-relevant decisions ; its typical 
element d " will sometimes be denoted briefly as d. 

Example. Let each consequence be a triple (ci, c 2 , c 3 ) defined in the pre- 
vious example; suppose you are indifferent to air pollution and that two methods 
of production, a and a', always yield (i.e., for every x ) the same profits but 
different amounts of air pollution. Then a and a' belong to the same w- 
relevant decision d w . 

Observing that w is constant on z w x d we may write without ambiguity 
ft>(z“, d“) where a> is defined over the domain x D a instead of X x A. 

In what follows we shall be interested in varying the payoff function w 
subject to a constraint depending on an arbitrarily fixed partition Z of X into 
events z. Given a payoff function w , Z is called oj-adequate if it is a sub-parti- 
tion of the aj-relevant partition see Marschak-Radner [12, (chapter 2)]; 
for example, Z A (relevant with respect to technology but not necessary with 
respect to tastes) is ^-adequate. Then every zeZ is a subset of exactly one 
z^eZ" and we can write, without danger of ambiguity, a j(z, d-"), where the 
function a) is defined over the domain Z X ZK Now let Q z be the set of all 
payoff functions co for which Z is o»-adequate. Given a fixed set Z of events 
we shall vary the payoff function co over the set Q z - 
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It has been shown— see Savage [16]— that certain plausible, quasi-logical 
postulates imply for a consistent decision-maker: (a) the existence of a 
numerical function u on the set C of consequences; hence a numerical func- 
tion a* on Z x D*; and (b) the existence of a probability measure P on X\ 
hence of a probability function p(z) on Z, with the following property: given 
two decisions d and d' in D w , a consistent actor will not prefer d' to d if 
(assuming Z finite for simplicity) 

(2.4) X co(z t d)p(z) ^ 2 v(Zf d f )p{z ) . 

Z 6 Z Z 6 Z 

The two compared averages are called the (expected) utilities of the decisions 
d and d', respectively. The proposition just stated is called the expected 
utility theorem. Roughly, it says that an actor conforming with certain 
consistency postulates (and with the rules of ordinary logic) maximizes the 
expected utility of decision. 

We stated above that the set A = {a} of actions (i.e., feasible acts) represents 
the actor's technology, and that the utility function u on the set C represents 
his tastes; hence the payoff function co(x, a) s u(a(x)) reflects both. We can 
now add his beliefs, represented by the probability measure P on X. In 
what follows we shall consider P and Z fixed; and we shall permit to to vary 
over the set Q z for which Z is w-adequate. This will enable us to discuss 
the “informativeness” of information systems for an arbitrary set of then- 
users: see Example in Section 5. 


3. INFORMATION SYSTEMS 


An information system Y is a set consisting of (potential) messages y. We 
shall regard Y as another partition of the set X of states x. Unlike parti- 
tion Z of X into payoff-adequate events 2 , the partition 7 of I into messages 
y is not associated with the feasibility of actions and the indifference among 
their results. Instead. Y is associated with some object— an instrument— 
that produces messages. See Figures la-lc. Example : The state x may be 
such that (a) my barometer shows low pressure: this message y ; and that (b) 
the visibility at the airport is low, thus affecting the success of a decision 
to fly: this is event z. 

In the language of information theory, a set Z of events would be called 
“source” and a set Y of messages, a “channel.” In the language of statistical 
inference, Z represents a set of alternative hypotheses, and Y represents 
the set of outcomes of an experiment and is itself called an experiment. 

If a probability measure P is defined on X, the joint probability function 
on Z x Y is determined. In fact, given a set (Y, Y\ Y", • • •) of available in- 
formation systems, the multivariate distribution on Z x Y x Y' x Y" X • • • 
is defined. We shall write, using the same symbol pi ) for probability func- 
tions over different domains, yet without risk of ambiguity: 


probability of event, 
probability of message, 


Pr(# ez) = p(z ) ; 
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Set X 

Zx Y Zx Y' 

©© 

le If 

Y' 

Q 

li 

Noiseless Cases ( ZsY , ZsY'), with YsY' 

Figure 1 

SET X AND ITS PARTITIONS 

Pr(a; e y) = p(y ) , 

(p(z) and p(y) are all positive since z and y are non-empty); 
joint probability of event and message, 

Pr(a e z n y) = p' K z, y) ; 

posterior probability of event, given the message, 

y)lp(y) = p(z I y) ; 

likelihood of message, given the event, 

p(z, y)lp(z) = p(y I z) . 

When comparing two information systems Y and Y f we shall also use 
expressions like the following: 

probability of joint occurrence of message yeY and y'eY', 

Pr(a eyf) y')= p(y, y') ; 


Z 



lb lc Id 


Partition of X 
Z Y 
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probability of joint occurrence of event z and messages y, y\ 

Pr(a e z n y n y') = p(z, y , y ') ; 
probability of y', given y, 

p(y, y')lv(y) = p(y' I y ) ; 

posterior probability of event z, given messages y, y f , 

p(z, y> y')lp(y, y') = p(z I y, y ’) ; 

and so on. 

For simplicity of reasoning, we have assumed the set Z of events and the 
sets Y, Y' f • •• , of messages finite. (No such assumption will be made about 
X , the set of states, except in Theorem 11.4.) Specifically, 

Z = (z lf • • • , z m ); Y = (y u • • * , y n ); Y' = (y[, • • • , yiO; m, w, w', • ■ • S 2 . 


The generic elements z of Z,y of F, y’ of F', • • *, can be regarded as random 
variables taking, respectively, the values 

Z;(i = 1, • • •, m); t/Xi = 1, • • •, w); y*(fc = 1, • • *, n'); 

In most of our discussion Z will be fixed, and the effects of varying Y (i.e., 
of replacing it by Y', say) will be studied— see Figures lb-lf— using for the 
three marginal probabilities the following vectors (the various alternative 
notations will be used according to convenience): 


the (m X l)-vector [p(z)] = [p(Zi)] = [r z ] = [n] = r 

the (n X l)-vector [p(y)\ = [p(yj)\ = [qj] s q Y = q 

the in’ X l)-vector [p(y f )\ = [p(y'k\ = [ql] = <? r ' = q r . 

For the joint probabilities of events and messages; the posterior probabili- 
ties of events (given the message in Y); and the likelihoods of those messages 
(given the events in Z), we use the (m X n)-matrices 


[p(z, y)\ = [pith yj)\ = [Pzy\ = [pu] = P r = P 

[p(z ! y)] = [p(*i I yj)] = [n zy] = [nu] = n Y s n 

[p(y I Z)] = [p(yj | Zi)\ = [X 2y ] = [lij] = A Y = A 

and the corresponding (m x w')-matrices for Y' 

P Y ' = P'; /7 r ' = 77'; A Y * = A' . 

By definition (writing henceforth S> S for summation over sets y, Z) 

V Z 


(3.1.1) 

(3.1.2) 

(3.1.3) 


Si Pzy — qy >0, Si Pzy — ?*z X’ 0, TZ z y = 0, == 0 

z y 

Pzy — ^zi/^z — Kzyqy = 0 

1 — Si Q'y — S r, = Si — S ^zy — S S Pzy • 

2/ z y z y z 


Moreover, with Z fixed, r is fixed and we have for any Y, Y' 


(3.2) 


Flq = r = Fl'q' . 
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We shall also use the ( n X w')-matrix 

(3.3) [p(y' | y)l = [p{y' k | y s )] ^ [ 7 „.] s [ Tjk ] = JH . 

Clearly, 

(3.4) TV#' 0; = 1 j 

V' 

(3.5) q' = r {T) q , 

where r (T) is the transpose of 7\ Similarly, we define the (w' x n)-matrix 
(3.3') [p( 2 / 1 2 /')] ^ LKyj I y' k )] ^ [TyA = Md = , 

(3.4') ri»' SO; S rir = 1 . 

(3.50 q = ry . 

Let T and V be two partitions of X. T is said to be a subpartition 
of (or finer than) V (or, equivalently, V is coarser than T), 

(3.6) T s V 

if each t in T is contained in one of the V in T'; or equivalently, there is 
a many-to-one correspondence T—>T'. Condition (3.6) implies— and, in case 
of X finite, is equivalent to— the following: for any teT,t f e T , 


(3.7) 



if tdt' 
otherwise . 


In particular, if the set Z of events is finer than the information system 
Y f (see Figures lg, lh) 


(3.8) 


Zs7, 


we shall say that Y is noiseless (with respect to Z). In this case there is a 
many-to-one correspondence Z-~>Y; and it will follow that 


(3.9) 


v(y I z) s z zy = 


{o 1 


if zoy 
otherwise . 


Thus each row of A Y consists of one 1 and n — 1 zeros if Y is noiseless. 7 
In Section 11, we shall consider the case when the two compared informa- 


7 This agrees with the terminology of Feinstein [7, (23)]; but Abramson [1, (11-2)] 
calls the case (3.10) a “deterministic channel,” and reserves the adjective “noiseless” 
for the case when the correspondence Z -> Y is one-to-many, hence in our notation, 


p{Z | y) = llzy 



if y c z 
otherwise , 


so that each column of 77 r consists of 1 one and m — 1 zeros. In this case, however, 
it should be easy (i.e., costless) for the user to consider all messages corresponding to 
the same event, as one and the same message; this establishes a one-to-one correspond- 
ence between the set of messages (thus redefined) and the set of events. This is some- 
times called perfect information. Then 77 and A are the same (permutation) matrix: 

, _ fl if 2 = y 

- Kzy - \ Q otherwjse _ 


It seems to us therefore that the case called “noiseless” by Abramson is of little in- 
terest: it collapses into the case of perfect information which is a specical case of 
noiseless information, in Feinstein’s and our sense. 
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tion systems, Y and Y' are both noiseless (see Figures lg, lh, li): 

Condition ( N ): Z s Y, Z s Y'. 

If two information systems Y and Y' (noiseless or not) are such that Y 
is finer than Y' f and thus Y' coarser than Y we write 
Condition (C): Y s Y f . 

In this case we can also say that Y' is obtained from Y by collapsing , or 
condensing, several messages in Y into a single message in Y' . Under 
condition (C) there is a many-to-one correspondence Y— >Y', and (C) will 
imply, as in (3.7) that 

fl if y ci/' 

(3.10) P(y'\y) = ryy' = L J 

10 otherwise . 

4. INFORMATION VALUES: GROSS AND NET 

The actor associates each message in Y with some decision in D w . This 
mapping will be called a decision rule, an element of a set Without 
danger of ambiguity we shall often omit the superscript co. Thus 

d = d(y) , 

(4.1) o)(z, d) = o)(z, d(y)); zeZ,yeY,deD,deA. 

Thus, given the payoff function co, the utility of the result depends on the 
event z, the message y and the decision rule o. 

The utility amount co(z, 5(y)) may be interpreted by expressing the economic 
effect (i.e., the effect upon the utility to the decision-maker), not only of 
decision d (y), given event z, but also of the decision rule d itself. For ex- 
ample, a simple (e.g., linear) decision rule is less costly to apply than a 
complicated one. Moreover, a decision rule from Y to D, where D consists 
of feasible decisions (actions), may itself be non-feasible— for example, if it 
is so complicated as to exceed the decision-making capabilities of available 
men or machines. Strictly speaking, we should define a feasible subset A v 
of A, for use in any further economic discussion of information systems. 
For the sake of simplicity we shall neglect in most of the following both 
the cost and the possible non-feasibility of decision rules and assume A<p = J. 

An optimal decision rule o* in A maximizes the utility averaged over all 
messages y in Y. The maximum expected utility thus achieved will be 
denoted by U(P r \ co) for it will depend on the joint probability P v of y and 
z (with Z fixed) and on the payoff function co: 

(4.2) U(Pr; «0 s £ S o*(y)) Pzy 

v z 

= S S w( ' z ’ 5 (y))vzv 

V z 

for all d in J. Then, since p zy = q y n zy , we have 

(4.3) U(P r ; co) = max ^ S d(y)) 

SeJ y z 

= y, <7v max V n zy w(z, d ) 

y deD z 

= Qv KzyW&t dy) — 2 S VzyO){z, d y ) , 

V Z y Z 


This content downloaded from 128.197.26.12 on Mon, 27 Jun 2016 09:07:40 UTC 
All use subject to http://about.jstor.org/terms 



INFORMATION SYSTEMS 


145 


thus defining d y = d*(y) as the optimal decision in response to message y: 

(4.4) 2 ^ y (o(z f d y ) ^ 2 XzyMfc* d) y all deD . 

z z 

In order to guarantee the achievement of a maximum in (4.3), (4.4), we shall 
henceforth assume that the set of real numbers cu(z, d), deD”, is bounded 
from above and closed and that w is continuous. 

A more appropriate and precise name for the quantity U(P Y ; <o) would be 
gross value of Y, in contrast to its net value. So far, we have neglected 
those economic effects of an information instrument that are due not to 
decision based on messages received through the instrument but to its other 
properties, roughly called “cost of information.” For example, one of two 
compared instruments may be more expensive, or it may send messages that 
while related to events by the same probability distribution are more time- 
consuming. Strictly speaking, these two instruments produce two different 
partitions of X : for the occurrence of a particular message belongs to the 
detailed description of a state x ; and the consequence may depend, not only on 
decision d and event z but also on the manner in which the message received 
through a particular instrument affects the user's resources and well-being, 
through the expenditure of time, money and effort involved in the process 
of receiving the message. Equation (3.2) should be replaced by a more 
general one 

(4.5) u(c) = u(d(z), y) = co Y (z, d, y) 

say, where y belongs to a set Y of messages associated with a particular 
instrument. The new function w Y may be called the net payoff function. 
In a simple but important case 

(4.6) a> r (z, d, y) = w(z, d) — My) , 

where, for example, Mv) is the monetary cost associated with message y , 
o)(z, d) is the (“gross”) monetary profit from decision d when event z occurs, 
and utility is linear in money: see Raiffa and Schlaifer [15, (section 4)]. In 
general the net value of an information system Y can be written as 

(4.7) Uy{Y) = max £ S ^r(z, %y), v)pzy 

dej 2 y 

so that in the special, additive case (4.6) we have, 

(4.8) Ur(Y) = U{P Y \ w) - £ My)p(y) . 

y 

In what follows we shall indeed assume (4.6), hence (4.8). This will permit 
us to study the economic effect of statistical properties of an information 
system, separated from the system's cost properties. But remember that in 
the general case (4.5), (4.7), the net value of an information system cannot 
be decomposed into gross payoff and cost as additive components. 

5. COMPARISON OF (GROSS) INFORMATION VALUES 

We shall compare the (gross) values of two information systems, Y and 
Y', having fixed the set Z of events and thus the vector r = [r 2 ] of marginal 
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probabilities of events and the set Q z of considered payoff functions. 

Example. To expand further the flight passenger's case of Section 2: We 
can ask whether a barometer or a hygrometer (with messages referring to 
humidity) is preferable to a user who has a ranch, an airplane, or both. 
The uses considered will define the set Z of payoff -adequate weather descrip- 
tions which will permit the necessary comparison. 

In the (“Bayesian”) spirit of the expected utility theorem, the two joint 
distributions P r and P Y ' (also denoted as P and P') are known to the deci- 
sion maker. He knows therefore also the pairs 

(/7*, q *) and (77*, <?*) , 

(A*, r*) and (A*, r*) , 

indicating by the star subscript (as it will be sometimes useful to do) that 
those matrices and vectors are known. Such knowledge is usually assumed 
in operations research problems— including the most general ones of sequential 
decision making, in which one starts with an initial distribution of states 
seen as time sequences and conditional upon the sequence of decisions: see 
Miyasawa [14]. The knowledge of r* is also often assumed in information 
theory: for each channel Y, the joint distribution p(z, y) is known, permitting 
one to compute, for a given source Z, the equivocation characterizing the 
channel Y s 

(5.1) H(Z | Y ) = - £ p(y) £ p{z | y) log p(z | y) . 

y z 

(Compare this with (4.3)!) 

Also, U(P Y ; w) and U{P Y '\ co ) can sometimes be ordered, for all oj considered 
without the knowledge of the distributions P Y aed P Y ’ and only knowing 
the process that produces Y and Y' and imposes certain general restrictions 
on the trivariate distribution of z, y, y ' . Such will be, in fact, the content 
of Sections 6 and 7. 

It is also useful to consider where the comparison between information 
values of Y and Y' is made by someone who uses the knowledge of 77 = 77* 
and 77' — 77* (ignoring or being ignorant of q, q f ); or who uses the knowledge 
of A = A* and A' = A* (ignoring or being ignorant of the value of r). Such 
cases of limited knowledge will be considered because they may simplify the 
procedure of comparing information values. In addition, an actor may have 
knowledge of posterior probabilities of x zy associated with each message y , but 
have trouble ascertaining the probability q y of each message: this makes the 
comparison of 77 with 77' interesting. (Note, however, that the probability vectors 
q,q f ,r even though ignored or not known are constrained by identity (3.2).) 

8 The equivocation H{Z \ Y) = 0 if and only if all p(z | y) = K zy — 1 or 0, i.e., if 
and only if Y is noiseless in the sense of Abramson and not in the sense used by 
Feinstein and ourselves. On the other hand, Y is noiseless in the latter sense, i.e., 
all p(y j z) = X zy = 1 or 0, if and only if the expression 

77( Y\ Z) = — Z Piz) 2 P(y I z) log p(y\z) = 0 . 

^ y 

Following Abramson, the latter expression might be called equivocation of Y with 
respect to Z . 
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Finally the ignorance of the (prior) distribution of hypotheses is maintained 
in non-Bayesian statistics, making it impossible to use other than the likeli- 
hood matrices A — A* and A' = A*. 

Whenever P Y , P Y ' are known (hence r, A , A', 77, 77' have known values r*, 
A *, etc.) and are such that 

(5.2) U(P Y ; w) ^ 7/(P F '; a>) , all in Q z , 

we shall write 

Condition (P): P Y > P Y ' (or briefly: P > P'), 

and say: “Y is more informative 9 than Y f as revealed by their joint distribu- 
tions P, P' with Z” or simply “P is more informative than P'” Note that 

(5.2) can be rewritten thus 

(5.2.1) 7/(77*, q *; <w) ^ 77(77*, <?*; a>), all to in .0*; or (77*, <?*) > (77*, g'*) , 
or 

(5.2.2) U(A* f r*; ©) ^ U(A*, r *; a>), all a> in or (A*, r*) > Ui, r*) . 

If r is ignored or not known, we may ask whether 

(5.3) U(A*, r; co) ^ T/Gf*, r; <w) , all oj in and all r ; 

and, if so, write (omitting the stars for brevity) 

Condition (A): A > A' 

and say: U Y is more informative than Y' as revealed by the likelihood 
matrices A and A f ” or simply: “A is more informative than A'” 

Finally, it may also be of interest to ask whether or not 

(5.4) 7/(77*, q; w) ^ 7/(77*, q'\ co) for all oj in Q z and all q,q' 

such that 77*g = 77*g' . 

This question arises naturally when the marginal probabilities q,q f ,r are 
ignored or not known. Since Z, hence r, is the same for both information 
systems, lT*q = 77*g' by (3.2). When (5.4) is satisfied we shall say (omitting 
stars for brevity) that “Y is more informative than Y' as revealed by the 
posterior probabilities 77 , 77'”; or briefly <e 77 is more informative than 77',” 
and write 

Condition (77): 77 > 77'. 

The binary relation “ > ” defined above on the set of all matrices P, or all 
matrices 77, or all matrices A, as the case may be, induces in all cases at 
most a partial ordering on the P (or the 77 or the A), hence on any set of 
available information systems. For, in general, there will be some pairs 
(P, P') or (77 , 77') or (A, A') for which there exist two payoff functions, co and 
a/ in Q z , such that the inequality in (5.2) or (5.3) or (5.4) is valid for oj but 
not for uj\ In contrast, the equivocation (5.1) used in information theory 
does not depend on the payoff function and thus induces complete (and weak) 

9 Strictly speaking, the term “not less informative” would be more appropriate. 
We have chosen terms that will be shown to harmonize with those of Blackwell: see 
Section 13. 
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ordering. It will be shown that lower equivocation is necessary but not 
sufficient for higher informativeness: see Section 12. 

Since q, q' and r are all fixed (at starred values) in condition (P) but not 
in condition (A) nor in condition (FI), condition (P) is weaker than either 
of the other two, and we have 

Theorem 5.1. (A) => (P); (77) => (P). 

We shall now prove that (A) and (P) are equivalent. 

Theorem 5.2. (P) «=* (A). 

Proof. By Theorem 5.1, it will be enough to prove (P)=>(A). Let 
P y = (A, r), P r ' = (A', r ) and assume that, as in (5.2.2) but with stars omitted 
for brevity, condition (P) holds. From (3.1.2) and (4.3), for any payoff func- 
tion (jjeQz from D w X Z to the reals, 

U(P y ; w) = U(A, r; w) 

^ max Y, r z lzyoy(z , d) . 

y d s D<° z 

For any given r — r define a new payoff function u) in Q z by 

(5.6) *>(z,d) = —a)(z, d); D* — D” . 

Then, replacing oj by u3 in (5.5), 

U{P r \ cd) — 2 max S rzAzyd)(Z, d) 

(5.7) y deD<u 2 

= U(A, r; oj) . 

By similar reasoning, 

(5.70 U(P yf ; aj) = U(A',r;a>). 

From condition (P), we know 

(5.8) U(P y ; id) ^ U(P y '; cd ) . 

Therefore by (5.7), (5.70 a nd (5.8), 

(5.9) U(A, r; w) ^ U(A f , F; co) , 

where r and oj are arbitrary, so that (5.3) is satisfied. Thus (P) implies (A). 

It follows from Theorem 5.2 that, if one system is more informative than 
another as revealed by their respective likelihood matrices, no further know- 
ledge is added by knowing the prior probabilities of events. On the other 
hand, we shall show in Section 9, Theorem 9.3 that (P) implies (II) only under 
certain conditions. Anticipating this result we state a summarizing 

Theorem 5.3. (77) (P) (A), 

using here and henceforth the symbol for “implies but is not implied by. ,, 

We conclude this section by remarking that it is not possible to apply the 
relation “more informative than’ 7 to net rather than gross information values. 
Modifying the notation of (4.7) in an obvious way, suppose that the net payoff 
functions w r and wy, defined respectively on Z X D X Y and Z x D X Y ! are 
such that for some positive number k 
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(5.10) k > Uy(Y; o> Y ) - Uy(Y'; «>*.) ^ 0 ; 

then there exists another pair of net payoff functions, a) V and oj Y , f such that 

Uy(Y ; cDy) - Ur,(Y w Y ,) < 0 : 

for example, let 

d, y) - o)y(z, d,y) — | -fc ; coy (z, d, y') = (z, d, t/') + . 

In particular, when the net payoffs can be decomposed into gross payoffs 
and information costs as in (4.6), it is always possible to imagine informa- 
tion cost functions K(y), / z\y ') that would reverse the second inequality in 

(5.10) . 


6. GARBLED INFORMATION 

For any joint distribution of three variables z, y, y’ the following identities 
hold: 

(6.i) p(yMy,g) _ v(z\v,y r ) g v(y,y'\z) 

v(y' I y) viz | y) viy I z) • viy’ I y) 

since each of these three ratios is equal to 

p(z,y,y r )-y(y) 

p(z,y)-p{y,y ! ) 

We shall say that, for a given set Z of events, the information system Y 
in garbled into Y' when, for all z e Z, y 6 Y, y' e Y the following condition 
holds: 

Condition (G): Each of the three ratios in (6.1) is equal to 1. 

(G) can therefore be written in three equivalent forms. Each seems to agree 
with the ordinary usage of the term garbling. 10 

(Gi) p(y' I y,z)= viy' I y) , 

i.e., given message y of the original information system Y, the conditional 
probability of message y' of the garbled system Y' does not depend on event z . 

(62) viz | y, y') = viz | y ) , 

i.e., the posterior probability of event z given the original message y, does 
not depend on the garbled message y'. 

(G s ) p(y, y'\z)= p(y \z)-p(y'\y ) : 

this describes the generation of the pair of messages, y and y f : while in 
general p(y, y' \ z) = p(y \z) • p(y' | y, z), this identity becomes in the garbling 
case, (G 3 ), because of (Gi). 

The following condition that is implied, but does not imply 11 condition (G), 
is obtained by summing (G 3 ) over y 

10 This is called “cascade” in information theory: e.g., Abramson [1, (section 5.9)]. 

11 A rigorous proof will be given in Section 9. 
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p(v' I *) = P ( y> y’ I z ) = 2L P(v I z)p(v' I y) , all z e Z, y> G Y' ; 

y y 

or, in the notation of Section 2, 

( 6 . 2 ) = ^jAzyYyyl • 

V 

Using matrix notation we rewrite this as 

Condition (7 1 ): = ylT , 

and obtain 


Theorem 6.1. (G)h=>(r) . 

A row-stochastic, or Markov, matrix is one with non-negative elements only, 
and with all row-sums=l. Denote the class of all Markov matrices of order 
n X n' by , and the class of their transposes, i.e., of all “column-stochas- 
tic” matrices of order n' x n, by -^ r >. Then obviously 

(6.3) r e r e . 


In honor of David Blackwell, we have called the following condition 
Condition (B): There exists a matrix B = [&■&] such that 


(6.4) 


(Bo) B e 

(BO A' = AB. 


Condition (B) is implied by the garbling condition (G), but not conversely; 
for we can prove 


Theorem 6.2. (G) (7 1 ) ( B ) . 

Proof. By (6.3), (T) implies (B). The converse is not necessarily true (see 
Section 9). Hence our theorem follows from Theorem 6.1. We shall now 
prove the important 

Theorem 6.3. (B) => (A ) . 


PROOF. For any (A, r) = (/7, q) and define for every message yj the 
optimal decision dj\ that is, by (4.4), 


(6.5) 2 XiMtit dj) ^ 2 mMPit d) , all d e D w ; 

i=l t=l 

or, omitting the summation limits for brevity and writing 

Co(Zi t dj) = V/{j , 

2 ^ 2j d ) , all d € , 

i i 

^ 'i TiAijUij = ^ i ViAij(l)(Zi, (Z) , 3-11 G f 

i i 

Then by (4.3) 

C7(P; ») s I7(^, r; ®) a U(II, q; ®) 

= S tUKijUi, = 2 rJijUij . 
itj i,i 

Similarly, for ( A',r ) = (77', <?')> we define d' k e D w and u[ k = <o(Zi, d k ) so that 


( 6 . 6 ) 

(6.7) 

by (3.1.2). 

( 6 . 8 ) 


This content downloaded from 128.197.26.12 on Mon, 27 Jun 2016 09:07:40 UTC 
All use subject to http://about.jstor.org/terms 



INFORMATION SYSTEMS 


151 


(6.6') ^ n'ikUik ^ n'uMZu d) > all d e Z) w , 

» * 

(6.7') £ r.vi.'tMu ^ £ «■<&»(«<, d) , all d e D" , 

< * 

and we have 

f/(P'; «) = r; <u) = 17(/7', q'\ m) 

(6 ’ 8,) = £ = £ rd'ikUik . 

♦ .* i. /c 

Suppose (B) holds. Then by (6.8'), (6.7), (6.8) we have, for any r, w, 

U(A; T] (o) = @jk y, TiXaUik 

J,k i 

^ 2 @ jk S Ti^ijUi j = 2 i^ijUij 
j,k i i,j 

= C7(^i, r; cu) . 

This proves that (B) implies (A). In Section 8 we shall prove the converse, 
so that (B) «=* (A). 

From Theorems 6.1, 6.2 and 6.3 we immediately obtain 

Theorem 6.4. If Y is garbled into Y f then Y is more informative than 
Y' as revealed by their likelihood matrices A and A 
The following theorem summarizes the inclusion relations stated in this 
and the preceding section: 

Theorem 6.5. (G) (7 1 ) (B) «=* (A) ~ (P) *=* (77) . 

The proof of this theorem will be complete when, as mentioned before, it 
is shown that (A) implies (B) (Section 8) and that (B) does not imply (0 nor 
does (T) imply (G) (Section 9). Pending these proofs we have proved so far 
only the weaker 

Theorem 6.6. (G) => (JT) => (B) => (A) ~ (P ) . 

7. COLLAPSING INFORMATION AND JOINING INFORMATION 

The case in which Y f is condensed from Y was defined at the end of 
Section 3 as 

(C) YsY' , 

i.e., each y in Y is contained in one y' in Y f . Hence by (2.7) 

either y<^y', p(y' I y) = 1, p(y, y'\z) = p(y | z) 

or y<£y', p(y' I y) = 0, p(y, y'\z) = 0. 

In either case, (GO is satisfied. Hence (C) implies (G); but the converse is 

obviously not true. Collapsing information is thus a special case of garbl- 
ing. We state this as 

Theorem 7.1. (C)n-(G). 

Let every message in Y be obtained by joining a message in Y’ with a 
message belonging to a third information system, T. Then, 12 since Y, Y', T 

12 See, e.g., Halmos [8], 


This content downloaded from 128.197.26.12 on Mon, 27 Jun 2016 09:07:40 UTC 
All use subject to http://about.jstor.org/terms 



152 


J. MARSCHAK AND K. MIYASAWA 


are partitions of X, 

Condition (J): Y —Y' x T = {{y’ n t)}, y r eY',te T . 

Clearly Y is finer than Y f , so that (J) implies (C); but the converse is, of 
course, not true, and we have 

Theorem 7.2. (J)i=>(C). 

We can combine these results with those of Sections 5 and 6 into 

Theorem 7.3. (J) n* (C) h* (G) (P) (B) ==> (A) <=> (P). 

Thus Y' is less informative than Y when Y f is garbled or condensed from 
Y ; or when Y is formed by joining Y ! with a message from a third infor- 
mation system. Garbling and condensing information occurs when an inter- 
mediate node is inserted in a communication network— as, for example, when 
reports are processed into commands. Joining information occurs when in- 
formation is centralized. It follows that inserting an intermediate node 
never increases, and centralizing information never decreases, the gross 
expected payoff. But the net payoffs may be ordered differently. For ex- 
ample, the cost of centralizing information may offset its advantages (see 
remarks at the end of Section 5) or such centralization may call for rules 
that are not feasible (cf. Section 4). 

8. EQUIVALENCE OF CONDITIONS (B), (A), (P) AND (0) 

Theorem 8.1. (B) *=> (P) <=> (A). 

Proof. Because of Theorem 5.2 and 6.3, it will suffice to prove (P)=>(B). 
We shall first prove the following: 

Lemma. If (P) holds, then for any real in' X m)-matrix V — [v*,-] there 
exists a matrix M = [m k j\ e ~ /Y, z] such that 

( 8 . 1 ) Y' i 'WlkjV' i^ijVik ^ Ti^ki^ki • 

i,k,j i,k 

To prove the lemma let us define, for a given matrix V = [Vki] the payoff 
function a o as follows: D" — {d u • • *, d 7l /}, and 

(8.2) co(z h dk) ~ Vki , i = 1, • • • , m; k = 1, • • • , n' . 

Now define, for each message yj in Y, the optimal decision dk^^D” by 
(^•^) ✓v dkU)) = i Jc — 1, • • n 1 \ j — 1» * * *» n \ 

i i 

then by (4.3'), (2.1. 2) 

(8.41 U(P Y -, w) = 2 TikiVkun . 

i,3 

Now, for any matrix T — [t ki ]e we have by (8.3) 

(8.5) y 1 i = V , T {X = y, t'kjTiXijVki t 
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since t kj ^ 0, 2 fos — 1- That is 

k 

( 8 . 6 ) U(P Y ; co) ^ 2 tkrtlijvu . 

i,j ,k 

Now by definition, 

(8.7) U(P y ‘; *>) ^ £ rJiiMzi, d k ) , all d k e £>“, i.e., 

i.k 

( 8 . 8 ) U(P y ';w) Xnl'itVu . 

i,k 

If condition (P) holds then, 

(8.9) U(P Y ; w) ^ U(P Y> ; co ) . 

Let our matrix T be such that 

fl if k = k(j) 

0 k j | 

10 otherwise. 

Then (8.6) becomes an equality. Therefore by (8.8) and (8.9) T has the 
properties of the matrix M required in the lemma. 

To complete the proof of the theorem, let 5^ be the set of all in’ X m)- 
matrices V = [v* t -] such that 0 ^ Vk% ^ 1; as before, let i¥ = [ m k j ] belong to 
and define a function ¥{M, V ) on ,^C r) X 5^ by 

(8.10) ¥{M, V) = V rrikjrdijVki — 2L • 

i,k 

Then f is a bilinear function of Af and F; and the factors ^C r) and 5^* of 
its domain are both closed, bounded and convex sets in (w' X n)- and (n ! x m)- 
spaces, respectively. Therefore, by a saddle point theorem— see, e.g., Karlin 
[9, (II, theorem 1.51)], there exists a pair [mlj = M° G ^ T ) and F 0 G^ 
such that 

(8.11) r(i¥, F°) ^ F°) ^ ¥{M\ V ) , 

for all 16.4 and Fg?^. 

Suppose (P) holds. Applying our lemma to F= F°, we see that there 
exists a matrix AfG^C r) such that ¥(M, F°) ^ 0. Therefore by (8.11) 
¥(M°, F°) ^ 0 and thus 

(8.12) ¥(M°, F) ^ 0 , all Fe . 

Define V iki) g ^ as a matrix whose ( k , z)-th element is 1 and all other 
elements are 0. Then by (8.10), (8.12) 

¥(M\ V' ki) ) = V, mljrfa - r,/ a 2? 0 , 

3 

and since all n > 0 by (3.1. 1\ we have 

(8.13) 2 m h*ij = , i = 1, • • • , m; k = 1, • • • , n' . 

i 

If in (8.13) at least one of the inequalities is strict, we obtain a contradiction 

(8.14) S mljhj = £ = »» > = m > 


This content downloaded from 128.197.26.12 on Mon, 27 Jun 2016 09:07:40 UTC 
All use subject to http://about.jstor.org/terms 



154 


J. MARSCHAK AND K. MIYASAWA 


since £ ml,- = £ - £ } ' ik = 1 - Therefore 

k j k 

(8.15) 'flikjAij ~~ Aik , i = 1» * • * i wij k — 1, ■ i t i.6.j 

3 

(8.16) A> = AM ( ° r) 

where M{ V) is the transposed matrix of M°, hence 

(8.17) MU e . 

Now putP = M ( ° r) . Then J? fulfills the conditions (6.4). This proves that 
(P)=»(B) and completes the proof of our theorem. 

We shall now show that the following condition (which will prove useful 
in Section 9) is equivalent to (B) and thus to (P) and (A): 

Condition (0): There exists a matrix 6 = [0jk] such that 

(00 )l 0 € w^(r) 

(00: 77' = lid and 
(00: q = 0q' . 

Note that (0) is a property of the joint distributions P, P', while (B) is a 
property of the likelihoods A, A' only. 

Theorem 8.2. (B)~(d). 

Proof. By definition (see Section 3), 

(8.19) qj = 2 rdu > q* = S r ^ ik > 0 

i t 

( 8 . 20 ) qjTZij = = 0 , qkT^ik ~ ^i^ik = 0 . 

Suppose (B) is true: that is, there exists B = [£;*] such that 

(8.21) = ^ o, £j9;* = l. 

j k 

Then by (8.19) 

( 8 . 22 ) qk — ^ 'PiAijfijh = qj0jk • 

i,j 3 

Given the matrix P define a matrix [0j/c] s 0 by the following equations: 

(8.23) 0jkqk = fijkqj > 2.11 3i k . 

It is easily verified that 0 is in (using (8.22)); that q = 0q f (using (8.4) 

and (8.23)); and that 77' = 77 0 (using (8.20), (8.21), (8.23)). Hence (B) => (0). 
To prove the converse, note that condition (00 implies, by (8.20) 

I ViAik v-" 1 _ n n 

7Zik — / — y, KijUjk — " 5k t 

ql j 5 qj 

(8.24) *» = «»£— 

j q 5 

Define B = [&•*] by (8.23). Then by (00 

£ Pit = — £ OjkQk = 3*- = l; ftjic > 0 , 

k qj k qj 
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hence Be ^ ; moreover by (00 and (8.23), A f = AB. Thus B satisfies both con- 
ditions in (6.4), and we have proved that (0)=>(B). This completes the proof. 

Combining Theorems 8.1 and 8.2, we have 

Theorem 8.3. (P) ~ (A) <=> (B) ~ (0). 

Using the matrix P' = [p(y | y')\ defined in (3.3') we introduce 

Condition (P'): 77' = 77P' . 

Then we have 

Theorem 8.4. (G) *=> (P') =* (0). 

Proof. Consider the identities 

p(y, z\y') = p(y, y', z)lp(y') = p(* I y, y')-p(y, y')lp(y') 
p(y, 2 11/0 = v(z | y, y')-p(y I y') ; 
summing over y 

p(z 1 2/0 = s v( z I y> y r )‘viy I i/O • 

V 

This becomes, if (G 2 ) holds, 

p(z l y') = 2 p( z I y)-v(y I v ') 

v 

77' = nr . 

This proves (G) =» (P). It is clear that if (r ; ) holds, then (0) is satisfied with 
0 = P'. This completes the proof. 

If a matrix P, and therefore also matrix 0, exists, it is possible to inter- 
pret the information systems Y and Y' as if Y were garbled into Y r , 
although it is not known whether or not the trivariate distribution on 
Z X Y X Y' satisfies condition (G). 

Example. As in Section 2, let z be the visibility at the airport. Let y be 
the true atmospheric pressure, and y r be the reading on a barometer. Then 
presumably condition (G) is satisfied, and by the proof of Theorems 6.5 and 
8.4, conditions (B) and (0) are satisfied by matrices B ~ P and 0 = P' (and 
possibly also other matrices). If, on the other hand, we interpret y r , not as 
a reading on a barometer, but as, say, the true level of humidity, then we 
have no reason to suppose that Y is garbled into Y r (or conversely). If the 
bivariate distributions onZx7 and Z X Y', respectively, happen to satisfy 
condition (B), hence (0), it may be useful to interpret the matrix 0 as if it 
were identical with P' (and B with P). That is, we can conceive of the mes- 
sages yj as follows: given yl, the message yj will be produced, using a random 
device, with probability p(yj | yl) — Ojk, j = 1, • • *, n. 

9. COMPARATIVE INFORMATIVENESS REVEALED BY POSTERIOR 
PROBABILITIES ONLY: THE CASE OF INDEPENDENT 77 MATRIX 

Let jt; and A be the j-th column of 77 and the /c-th column of 77' respec- 
tively. Now, Jr j and n' k can be interpreted as m-dimensional vectors or points 
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in ra-dimensional space. Accordingly 77 and 77' can be interpreted— whenever 
it is convenient— as two sets of vectors: 77 = { 7 ^, , tt» }, 77' = {;:(•, 

Let S m ~ l be the simplex defined by the set of all points v = {v u • ■ • , v m ) such 
that Vi ^ 0 , 2 — 1- Clearly both 77 and 77' are subsets of S m ~ 1 . 

Define the following condition: the set 77' is contained in the convex hull 
K(II) of the set 77: 

Condition {K)\ n f dK{lJ). 

This is represented by Figures 2a, 3a, 3b, 4, but not by Figures 2b, 3d. 
Clearly (K) is equivalent to the following condition: there exists a matrix M 
such that 

(9.1) lG^ (r) ; 77' = TIM : 

(9.1) is identical with conditions ( 0 O ) and (00 of Section 8 . Hence ( 0 ) which 
requires, in addition, (0 2 ), implies but is not implied by (K). We have thus 

Theorem 9.1. (0)1— (if). 

In this section we shall consider the case when the columns of the matrix 
77 (but not necessarily of 77') are linearly independent, that is: 

Condition (I*): (rank of 77) = n. 

This condition will be later shown to apply in two important cases: the 
binomial case (n = 2) of Section 10 and the noiseless case of Section 11 . 
Similarly we introduce 

Condition ( h ): (rank of A) = n . 

Noting that the r* and < 7 , are all > 0 it is easy to prove 

Theorem 9.2. (7*) — (I*). 

Now we shall prove 

Theorem 9.3. If conditions (/*) and {6) {hence also condition {B)) are true , 
c/iew (75) and {0) are satisfied by a unique pair of matrices 75, 6 related by the 
equations 

(8.23) d jk q'k = j3jA-<D > M 3, k . 

Proof. By the hypothesis, we have 

(00: 77' = 770, (BO: A f = AB , 

and 77 consists of linearly independent columns. Then by Theorem 9.2, ^ 
also consists of linearly independent columns. Therefore, in the above rela- 
tions (00 and (BO, 0 and B are unique. Now in proving Theorem 8.2 we 
have shown that if a matrix 75 satisfying condition (B ) exists then a matrix 
0 defined by equations (8.23) will satisfy condition ( 0 ); and conversely. Hence 
the unique pair B , 0 satisfying (B), ( 0 ) respectively must also satisfy equations 

(8.23) . Stronger than Theorem 9.3 is 

Theorem 9.4. [(70 and (75)] <=* [(B) with unique B] *=> [(0) with unique 0]. 
Proof. By Theorems 9.2 and 9.3, it is sufficient to prove that, when (B) 
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Figures 2, 3, 4 

SIMPLEX REPRESENTATION OF POSTERIOR PROBABILITIES 


Itlj, 7T2fr 




Figure 2 
m = 2 

Y, Y' binary: m = 2 = n = n' 



Figure 3 
m = 3 



m ^ 2 

Y BINOMIAL: n = 2 

holds and B is unique, then (I A ) holds. Suppose (B) holds, with unique 
B = Wjk ] but (I A ) does not hold. Let g < n and let X u • • •, X g be linearly in- 
dependent, while Zg+ lf are linearly dependent on {x u That is 

g 

(9.2) Xh — 2 diuM , h = g -f 1, • • • , n . 

i=i 

By condition (B) and (9.2) 
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(9.3) Ale = 2 "h 2 a hl@hk)Xi , k = 1, • • • , n' . 

i = l Z=1 \ A=ff-rl / 

Let us define a matrix J = [5^-] as follows: 


(9.4) 

Ojl — Pn + £ j 




(9.5) 

1 

° JS ~ & JS n' — i ’ 


s = 2, • j=l, 

•,7l , 

where sj, j = 1, • 

• • , n are numbers such 

that 



(9.6) 

n 

£l -f- CLhlZh = 

0, 

1 = 1. • 

“,0 • 


h=g-rl 


We shall prove our theorem for the case in which all fij k > 0. Clearly it is 
possible to choose all ej, j = 1, • • • , n w T hich satisfy (9.6) and have arbitrarily 
small absolute values. Then, for Pj k > 0 and \bj\ sufficiently small, we have 
by (9.4), (9.5) 

(9.7) 8 jk ^ 0, S = 2 = 1 • 

k k 

By (9.4), (9.5), (9.6), 

(9.8) Olk + a hldhk = 01k + a hl0hk • 

h h 

Therefore by (9.2), (9.3), (9.8), 

(9.9) Xk = ^ . 

1=1 \ h J j = 1 

Then it follows from (9.7) and (9.8) that A e A ^ B y and A f = AA. The part 
of the theorem involving 6 is proved similarly. 

We shall now show that, under condition (I*), if Y' is obtained by garbling 
T— see Section 6— then the matrices B and 6 are identical, respectively, with 
the matrices r = [y jk ] and Z 7 ' = [y'j k ] defined in Section 3: 

rjk = viy’k I yi)-, r'ik = p(yj I y'k ) . 

Theorem 9.5. [(/,) and (G)] =>{B= r, d = r f ). 

Proof. By Theorem 6.5 and 8.2, condition (G) implies that 
A' = Ar, A' = AB t W = no . 

Hence /i(r — 5) = 0. Let (I r ) hold. Then by Theorem 9.2, ^4 consists of in- 
dependent columns, and therefore r = B, i.e.: 

rjk = p(y’k I yj) = Pik , all j, k . 

By Theorem 9.3, 6 is uniquely defined by 

Ojk = pjkQjlq'k = p(y'k I yj)p(yj)lp(y'k) 

= p(yj I y'k ) ; 

hence 

[@jk] = [“{ jk\, d = T f . 
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We shall now prove that (r) does not imply (G), nor does (B) imply (r). 
Condition (P) means 

(9.10) p(y' \z)=^p(y\ z)p(y' | y) . 

V 

On the other hand we have identically 

(9.11) p(y ' | z) = £ p(y, y' 1 z) = £ P(V I z )v(v' I z, y) • 

y v 

If P is linearly dependent, then by Theorem 9.4, equations (9.10), (9.11) are 
consistent with the inequality 

p(y f I y) * v(y' I z, y ) , 

as contrasted with (Gi). Hence (G) does not necessarily follow from (H. 
Similarly we can prove that (B"> does not imply (T). 

We shall now show that the independence condition (I T ) is both sufficient 
and necessary for the equivalence of the conditions (P), (II) and (K). We 
shall start with 

Theorem 9.6. (/*) — ((K) ~ (P)). 

PROOF. By Theorems 8.3 and 9.1, we always have (P)=>(K). Therefore 
we need only prove that (K)=>(P) if the linear independence condition (I*) 
holds; i.e., that 

(9.12) (Is) =* ((K) => (P)) . 

Let Y and Y' be two information systems with known P r = ( [IT *, q *) and 
P Y ’ = (Pi, qi) respectively, where by (3.2) 

(9.13) P*q* — P*q* . 

Suppose (K) holds; that is, there exists a matrix M satisfying (9.1^ with 
n = P*, P' = Pi. Then by (9.13) 

(9.14) P*(q* - Mqi) = 0 . 

If (I*) holds for P*, then, the linear independence of its columns implies 

(9.15) q* = Mqi , 

so that, by (9.1), all three components of condition ( 0 ) hold. Hence by 
Theorem 8.3, (P) holds. This proves that 

(9.16) (I,)-((K)-(P)). 

We shall now prove 
Theorem 9.7. (J*)=-((P)~(P)). 

Proof. Since by Theorem 5.1, (II) always implies (P), we need only to 
prove that (P) implies (11) if (I*) holds; i.e., that 

(9.17) (J Zff)-((P)=*(P)). 

First note that since (P) is equivalent to {0) by Theorem 8.3, condition (IT) 
is by definition equivalent to the following: for any q, q' such that 
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(9.18) 77 *g — 77*g' 
there exists a matrix 0 such that 

0 g r > , 

(9.19) 17'* = 77*0 , 

q = 0q ! . 

The last three conditions correspond to the three components of condition 
(0), applied here to all q, q f consistent with (9.18), and not only (as in (9.13)) 
to a particular q*,q*. 

Suppose (P), or equivalently (0), holds for the known (starred) distributions; 
that is, there exists a matrix 0* (say) such that 

(9.20.0) 0* G , 

(9.20.1) 77'* = 77*0* , 

(9.20.2) g* = 0*g* . 

Thus, under condition (P), there exists at least one triple (g, q', 0) which 
satisfied (9.19), viz., the triple (g*, g*, 0*). Now let g, g' be undetermined 
except for the constraint (9.18). Then by (9.20.1), 

(9.21) 77*(g - 0*g') = 0 . 

If the columns of 77* are linearly independent then (9.21) implies q = 0*g'; 
i.e., (P) then implies (II) with 0 = 0*. Therefore 

(9.22) (7*)=*((P)=*(77)). 

This completes the proof. 

Combining Theorems 9.6 and 9.7, we have 

Theorem 9.8. (/*) =* ((K ) «=> (P) — (77)). 

Example. Let m = n ~ n' — 3. Figure 3c shows a sequence of triangles 
(with vertices representing the matrices 77, IT , 77'', 77°), nestling in each other 
and containing the point r; the largest triangle coincides with the simplex 
itself, and the sequence shrinks to the point r as its lower bound. The 
vertices of any of these triangles represent the set 77 = {n l9 ir 2 , tt 3 } of some 
information system Y, and the triangle itself is Kill). If the triangle cor- 
responding to Y' is nestled in the triangle corresponding to Y, then 77'cif(77); 
and since the vertices are not colinear, condition (I <T ) is satisfied. Hence by 
Theorem 9.2, 77 > 77': the system Y is more informative than Y'. as revealed 
by the posterior probabilities alone. Note that the vertices of the largest 
triangle in the sequence, i.e. of the simplex itself, give perfect information; 
while the lower bound represented by point r is just as informative as 
“null information,” i.e. the knowledge of the prior probabilities alone, with- 
out any messages. The sequence is ordered according to informativeness, and 
so is any other sequence of nestling triangles. The set of all triangles is a 
lattice, partially ordered by the relation “more informative than,” with “com- 
plete information” as the maximum element, and “null information” as the 
infimum. 
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10 . BINOMIAL AND BINARY INFORMATION SYSTEMS 


The numbers m, n, n' are always ^ 2. Following Blackwell and Girshick 
[4] the case m ~ 2 will be called dichotomy . If information system Y consists 
of 2 messages, n — 2, it is called binomial ; and when m — n — 2, the infor- 
mation-theoretical literature— e.g., Feinstein [7]— calls the channel (the in- 
formation system) Y, binary . We shall denote these conditions thus: 

(d): m = 2; (b): n — 2; (bO: n' — 2 . 

Thus a binary channel is defined by (b) and (d). 

Since always m ^ 2, condition (b) implies that the columns of the matrix 
17 are linearly independent. Thus (b) implies (I*), but of course, not con- 
versely. Hence by Theorem 9.8 

Theorem 10.1. ( b ) h* ((#) ~ (II) (P)). 

The case is illustrated on Figure 4, where the points i , /r 2 representing ~Y, 
and the points n[ f icon's representing Y' are all arranged on a straight line 
in a space of m dimensions (arbitrary m ^ 2), and all n' k lie between ir l and zr 2 . 

The more special case n = n' = m = 2— i.e. (b), (b') and (d)— occurs in the 
testing of simple hypotheses and in comparing “binary channels.” After 
labelling the two events z x and z z arbitrarily, let us label the two messages 
in Y so that 


( 10 . 1 ) 


TTll i= 7^12 • 


Then condition (K) takes the form 


( 10 . 2 ) 


TTli ^ 7 T 11 ^ 71 12 j 
K\\ ~ 7Ti2 ~ 7Ti2 , 


and by Theorem 10.1, we have 

Theorem 10.2. Let Y and Y' be two binary channels (i.e., let m = n = n' — 2); 
then Y is more informative than Y' if and only if the posterior probabili- 
ties satisfy the relation (10.2); provided the two messages in Y are labelled 
so as to make x n }> n l2 . 

This criterion (usable only when the posterior probabilities are known) is 
simpler than the one provided by Blackwell-Girshick [4, (section 12.5")] and 
based on the likelihood matrix A. 


11 . COMPARISON OF NOISELESS INFORMATION SYSTEMS 

Let Y be a noiseless information system, i.e. ZsY (see Section 2). Then 
hj = p(yj\Zi) = 1 or 0, all i, j. In general, V. ^ = 1 and ]T,. rAij = qj > 0. 
Therefore, if Y is noiseless, then each row of its A matrix contains one and 
only one non-zero (= 1) element and each column contains at least one non- 
zero ( = 1) element. Now — rAijlqj. Hence, whenever X-,j — 1 or 0, then n,j = 
Tiiqj (> 0) or 0, respectively. Therefore; 

If Y is noiseless , then each row of its IT matrix contains one 
(11.1) and only one non-zero (> 0) element and each column contains 
at least one non-zero (> 0) element. 
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Using (11.1) we shall prove the following property of condition (N) defined 
in Section 3: 

Theorem 11.1. (N)\=>((I«) and ( 1 , 0 ) - 
PROOF. Suppose Y is noiseless and let 

(11.2) S gjTtis = 0 , all i = 1, • • • , m , 

j=i 

for some real numbers g i, • • •, g n . We shall then show that g 3 = 0, for all j, 
i.e., the columns of 77 are linearly independent. We note that by (11.1), for 
any fixed j , say j = j u there is at least one i, say i = i lf such that >0. 
Then by (11.1), — 0, for all j =£ j\. Therefore, for i = i x , (11.2) becomes 

gj 1 xi 1 j 1 = 0. Since 7?*^ > 0, we have 

(11.3) ^ = 0, ji = l, 

that is, (I,) holds. Similarly, for Y' noiseless, we have ( 1 , 0 . The converse 
obviously does not hold. 

Using Theorems 9.7 and 11.1, we have 

Theorem 11.2. (AT) ((P) ~ (77) ~ (X)). 

In Section 2, we defined condition (C); and by Theorem 7.3, (C)n*(P). 
Hence by Theorem 11.2, we have 

Theorem 11.3. (AT)N-((C)=>(/7)<-(P)<=>(iC)). 

Thus, if both Y and Y' are noiseless, and Y is finer than Y', then Y is 
more informative than Y', as revealed by posterior probabilities of events 

(11.4) 77' = TIM, M = [ Mjk ] e , 

which is, of course, our condition (K). 

Remark 1. The matrix M = [/iit] in (11.4), with ZsYsY', 77' = 77A7, has 
the following interpretation: 

M=r = [ r ' jk ] = [p( yj |yt)], 

since by Theorems 11.1, 11.3, 7.1, 9.5, we have 6 = T 7 ' = AT. 

Example . Let YsYsY', so that by Theorem 11.3 and Remark 1, 77' = TIM , 
A/* = p(t/i | yi); let the posterior probability matrices be as follows (with 

0 < 7r(i = 1 — 7Tai < 1); 



77 


77' 

1 

0 

0 

7^ 0 

0 

1 

0 

7T21 0 

0 

0 

1 

0 1 


Then y[ = y i U i/ 2 , 2/2 = ?/3. Since 77 is the identity matrix 7, we have 77' = 
TIM = IM — M = T 7 '. The case is illustrated in Figure 3b, where the point 
7r'i divides the base of the triangle in proportion nuinh = p(y z \y[): p(yi\y{). 
Now we shall prove the following 
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Theorem 11.4. (. N ) ((C) «=* (K ) <=> (77") <=> (P)), w/tew X is finite. 

Proof. By Theorems 11.2 and 11.3, it will suffice to prove that 
(N) m ((K) => (C)), when X is finite. 

Let us assume (N) and (K): that is, ZsY, ZsY f and (11.4) holds, so that 

(11.5) Kik = 2 nijjuLjk , all j, k . 

2 

For any fixed fc, say k = k u let z^eZ be contained in y k (remember that 
Z is a sub-partition of the noiseless information system Y'.) Then Ii lkl = 1, 
and 

( 11 . 6 ) n i\ k l > 0 . 

Since Y is noiseless, by (11.1) the i^th row of 77 contains one and only one 
non-zero element, say n il j l > 0. Then 

(11.7) 7 > 0, = 0 , for all j ^ j\ . 

In (11.5), let i = i v and k = ku then by (11.7), 


(11.8) 

7Tii&i y 

hence 


(11.9) 

Mh*i > 0 » 


by (11.6), (11.7), (11.8). Suppose X is finite. Then > 0, (i.e., X il j L = 1) 
implies z il c y^ t therefore, (11.7) admits of the following interpretation: 

^ ^ For any fixed k = k v and any z h c there 

exists at least one yj x such that z iL c yj L . 

Now, for this yj v let z if eZ be contained in yj v Then = 1 and 

(11.11) > 0, 7ti,j — 0 , for all j =£ j\ , 

by (11.1). Then, for i = i f and k = fci (11.5) becomes 

(11.12) 7T^ fcl = , 

so that by (11.9), (11.11) and (11.12), 7c ifkl > 0, ^ kl = 1. When X is finite this 
implies z if c y’ kl . This proves that 

(11.13) For any y ^ such that y^ C\y kl ^ 0 y^ c y' kl . 

From (11.10) and (11.13), we conclude that, if (N) and (K) hold and X is 
finite, then any y k is a union of disjoint y's , i.e., YsY', viz., (C) holds. 
This completes the proof. 

Remark 2. Let Y be noiseless: If the columns of H ' are linear combi- 
nations of 77, that is, if there exists a n X n' matrix A = [ay fc ] such that 
77' = IT A f i.e., 

(11.14) 7t' ik = 2 xij a jk , all i f k , 

2 

then Ae^/^nxl'] hence (K) holds. 
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To prove, sum both sides of (11.14) with respect to i; then ^jOCjk = all 
k (since = St == D- It remains to prove that a jJe ^ 0 for all j , fc. 

Fix j = ji, k = fci, and let i = i\ be such that > 0. Since Y is noiseless, 
we have by (11.1), (11.14) 

~ Xi ih a h*i * 

Since > 0, if K r ilkl > 0 we have aj L kl > 0; and if = 0, then aj lkl = 0. 
Hence aj k ^ 0, all j, k , completing the proof. 

12. CONVEX OPERATORS ON POSTERIOR PROBABILITIES: 

THE EQUIVOCATION PARAMETER 

For any payoff function w, we shall define a function on the simplex 
S m_1 (i.e. f on the set of all vectors ? = (&, • • *, <? m ) such that ?» S 0, & = 1) 
as follows: 

(12.1) y(£) = max 2 <«(*«» d)& • 

deD» i=sl 

Then by (4.3), we have 

(12.2) U(P;<o) = ±qM*j), 

j = i 

where n-j is the jf-th column of 77. Consider a pair of probability distri- 
butions (77, q), (77', q') with the following property: 13 
Condition (9): For any convex function (p on S 7,1-1 

(12.3) S flwfa) ^ ii ?*?>(**) . 

fc=l Jfeal 

We shall prove 14 
Theorem 12.1. (P) ~ ( 9). 

Proof. First we shall prove (9) =» (P). We have by (12.1), for any two 
vectors = ($ n , • • •, $ ml ), £ 2 s (£12, • • -y$mz), both in S m_1 , and letting 0 ^ a ^ 1, 

^o>(a?i + (1 — a)£ t ) ^ a max m(Zi, d)6i + (1 — a) max V, wfo, 

( 12 . 4 ) dGD* i deD“> i 

= ay»(?i) + (1 — a>co(9 2 ) . 

That is, is a convex function on S m-1 . Therefore, using expression 
(12.2) for U(P; (o) and a similar one for U(P';<o) t condition (9) is seen to 
imply that U(P;a>)}£ U(P';c w), all w, i.e., condition (P). 

13 In DeGroot’s [6] definition of an information amount I[Y; r\ 9] it is reasonable to 
require that I[Y, r; 9] > 0 for all Y and r. He proves that an (uncertainty) function 
9 is concave on S m_1 is a necessary and sufficient condition for that. 

14 Compare also Blackwell-Girshick [4]. While that book is restricted to finite sets 
of messages and of events, earlier work of Blackwell [2. 3] extends also to cases when 
the set of messages are continuous. Accordingly our conditions (B), (W), (9), all 
equivalent to (P), are generalized into (B), (W), (9), say. Blackwell [3] proved that 
(B) => (W) and also (B)=>(c); compare also DeGroot [6]. We do not know whether, 
conversely, corresponding to our theorems for finite information systems, one also has 
(W) =* (B) and (9) =* (B). 
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Next we shall prove (P) =>(£>). By Theorem 8.3, (P)<=>(0). Therefore, if 


(P) holds, 

then there exists 


such that 

II 


(12.5.0) 

Oik so, X On = 1 - 

j 


(12.5.1) 

Kk — ^ KjOjk t 

j 

all k , 

(12.5.2) 

Qi = x On q'k , 

all j . 


k 


Hence for any convex function <p on S m ~ l , we have 


( 12 . 6 ) 


i k i 


S X 9kv ( X QikXj) = X Qk9(.~k) > 

k \ j / k 


hence (P) ==> (^>), completing the proof of the theorem. 

Let (p* be a particular convex function on S m_1 , and suppose 


Condition (^*): Qj<p*(xj) ^ 2 0*9*0^) • 

j=l k = 1 

Clearly (y>*) cannot be a sufficient condition for (P) to be satisfied: for (P) 
induces only a partial ordering on the set of information systems while (c?*) 
induces a complete ordering. At the same time, by Theorem 12.1, (©*) is a 
necessary condition for (P). Thus 


Theorem 12.2. (P) ~ (<p) n> O*). 

An important special case is that of the equivocation parameter H(Z\Y) 
of classical information theory, defined in (5.1) or, in another notation, by 

(12.7) H(Z\Y) = Qj £ ^ log r,/ . 

The function 

i 

where g = (&, • • *, f TO )e S m ~ l , is well known to be convex. 15 Therefore, if we 
introduce 


Condition (H): H(Z | Y) ^ H(Z | Y r ). 
we can replace, Theorem 12.2, (<p*) by (H) and obtain 
Theorem 12.3. (P)n*(£T): 

i.e., lower equivocation is necessary but not sufficient for higher informa- 
tiveness. 


13. SOME RELATIONS WITH BLACKWELL’S RESULTS 

Using our terminology and notation we can say that in Blackwell [2], an 
information system Y is defined by its likelihood matrix A without referring 


15 Since d 2 (£; log $i)}d$] = 1 /£» > 0 for non-negative fr, the matrix of second deriva- 
tives of H(£), which is diagonal, is positive definite— a suffient condition for convexity. 
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to the probability distribution on Z . Given an information system Y and a 
payoff function co, we shall define a set W% in ra-dimensional space as follows. 
Let D 03 be the set of decisions related to oj and let be the set of all 
decision rules o from Y to £K Corresponding to each ded%, define a point 

(13.1) w(d) = MS), * • •, co m (o)) 
in ra-dimensional space by 

(13.2) C 0 i(d) = 2 °*( z i’ Xy))p(y I zO , i = 1, • • • , m . 

y 

Then the set W$ is defined as follows 

(13.3) Wf = {w(d);deJ%} . 

The set is defined similarly. Assume that both and W^, are closed 
sets. In BlackwelTs definition, 16 Y is said to be more informative than Y ' , 
if the convex hull of Wp, K(W%), contains W$, for all co. We shall call 
Condition (W): K(W$)^> W"/, for all 
Blackwell [3] and Blackwell-Girshick [4] have proved 

Theorem 13.1. (TP)~(JS). 

From this theorem and our Theorem 8.3, we immedicately obtain 

Theorem 13.2. (P) ~ (A) ~ (T7) *=* (B) <=> (0). 

Here we shall give a direct proof of the equivalence of three conditions (P), 
(A) and (W) without going through condition (B) 

(13.4) (P)-(A)~(W). 

First note that U(P ; <w) can be expressed by means of the set W% defined 
in (13.3) 

U(P; (o) = £704*, r*; co) = max S <"(**» d (y))P(y I *<) 

<5ej£ i 0 

= max Yi r*iWi(p) , where P = (A*, r*) , 

56 » 

(13.5) 17(P; <o) = U(A *, r*; co) = max £ , where w = (wi, •••, w m ) . 

weW% i 

Similarly, we have 

(13.6) U(P'; w) = U(A *, r*; co) = max J] r^Wi , where F = 0**, r*) . 

Suppose (W) holds, i.e., W?,*, for all co. Then for any r, 

(13.7) max 2 nti>i ^ max 2 W* » for all co . 

wetv" i wzWy, i 

By (13.5) and (13.6), the relation (13.7) is equivalent to 

(13.8) £70**, r; co) ^ £704*, r; co) , for all r, co . 

Hence (W) implies (A). And since clearly (A) =* (P), we have proved that 

16 A definition of more informativeness by condition (W) was originally proposed 
by Bohnenblust, Shapley and Scherman [5]. 
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(W) => (A) =» (P). 

Next, we shall prove that (P) implies (W); that is, if 

(13.9) u(A*r*; a>) ^ u(A*, r*; <o) , for all w , 

then (W) holds. (The proof is similar to that of Theorem 8.1.) Let (g u • • •, g m ) 
be an arbitrary set of m real numbers; define 

(13.10) hi = gilr*i i=l,*-,m. 

For any payoff function w, we define a payoff function a) by D s = D" and 

(13.11) to (z if d ) = hi(o(Zi, d), d e D w , i = 1, • • • , m . 

Then it is clear that 


(13.12) wt(8) = Yi d (y))p(.y i z 0 = hiWiio ) , i = 1, • • • , m , 

V 

where 3eA^ = A“>, and Wi(d) is defined by (13.2). 

Let us define a linear transformation T from m-space to ra-space by 
Tw = T(w lf • • • , w m ) = (ftiWi, • • •, h m w m ). Then from (13.12) we have it’(.o) = 
(wi(<5), • • • , w m (d)) — Tw(8), for all deA” = A'f. Therefore W l “ defined by (13.3) 
and W™ defined similarly with respect to id are related by 

(13.3) W” = TW“ . 

From (13.5) and (13.13), we have 


(13.14) 


{704*, r*; to) = ma 

w e 

= max Yi r*i{Tw)i 

W 6 W^r 


where ( Tw)» is the -i-th coordinate of the point Tw, that is, from the defini- 
tion of T and (13.10), 

(13.15) ( Tw\ — hiWi = — Wi , i = 1, • • *, m . 

Therefore, from (13.14) and (13.15), we have 

(13.16) E7(/4*, r*; 5) = max 2 gwi . 

Similarly 

(13.16') U(A'*, r*; £) = max 2 ffWi • 

weir", i 

From (13.9), (13.16) and (13.16') we have 

(13.17) max ^ ^ max 2 > 

weir" * «6irJ * 

where (gh, - " t gm) and a) is any set of m real numbers and co is any payoff 
function. Therefore (13.17) implies 


K(W$)=*W% , for all <o , 

that is, (W). Accordingly (P)=»(W) is proved. This completes a proof of 
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the equivalence relation (13.4) which includes Theorem 5.2 as a part of it. 


14. dichotomies: m = 2 


In the dichotomy case, Z consists of two events Zi and z 2 : m = 2. As in 
(9.19), let 

(14.1) H*q = n*q ! (—r ) , 

where 77*, 77*, hut not q,q',r are fixed. Henceforth we shall omit the 
asterisk without ambiguity. 

With respect to an information system Y , define two functions G r (t) and 
F r (t) on [0, 1] as follows: 17 

(14.2) G Y {t) = £ . 

(14.3) Fr(t) a j f Gfr(«)d» . 

Gr/(i) and Fr>(£) are defined similarly with respect to Y*. Clearly G r (t) is a 
monotone non-decreasing step function with a jump of qj at t = j = 1, • • •, n. 
From the consistency condition (14.1), 

F r (l) = i G Y (t)dt = 1 — 2 = 1 — n = r 2 , 

Jo j 

Fy»{t) = ( Gy,{t)dt =1 — 2 tfW* = 1 - ri = r 2 . 

Jo fc 

Therefore 

(14.4) F r (l) = Fy,(l) = r 2 . 

It will become clear that the relation (14.4) is a key point in our proof of 
Theorem 14.3 below. Now we shall introduce 

Condition (F): F r (t) ^ F Y ,{t\ all t, 0 £ t £ 1. 

Then, following a reasoning similar to that of Blackwell-Girshick [4, 
(theorem 12.4.1)], it is easy to prove 18 


17 Compare Gy(f) and Fy{t) with Fp(t) and Cp(t) in Blackwell-Girshick [4, (theorem 
12.4.1)]. If n = ra = 1/2, then our Gr(t), Fy(t) become equal to Blackwell-Girshick’s 
Fp(t), Cp(t) respectively. 

18 We shall show that Theorem 14.1 can be proved by almost the same device as 
given by Blackwell-Girshick [4] in their proof of their Theorem 12.4.1. When m = 2, 
condition ( <p ) is equivalent to the following condition: For any convex function on 
[0,1], 

n n f 

Z qj<p(mi) ^ Z qk'<p(nik) . 

j=i fc=i 

Defining a function ft(u) on [0, 1] for each t, 0 ^ t ^ 1, by 



for 0 

for 0 S u S 1 , 


any (continuous) convex function (p on [0, 1] can be uniformly approximated by a 
function of the form ( Continued on next page) 
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Theorem 14.1. When m = 2, (?) <=* (F). 

Then by Theorems 12.1 and 14.1, we have 
Theorem 14.2. When m = 2, (77) *=> (F). 

Referring to the definition of points i ry, in S m ~ l , we shall introduce the 

following condition for m = 2. 

Condition (c): All points n r k of 77 ' lie between two consecutive points of 
77. (See Figure 5a). 

On Figure 6a, the graphs of the functions Grit ) and Gy.it) are drawn on 
the same interval [0, 1]. The interval [0, 1] is partitioned into sub-intervals by 
points K n ,- • *, and tt'u, • • *, Among these sub-intervals, the ones w T here 

(14.5) Grit) > Gy.it) 

shall be denoted by I+\ 7+ 2) , ••*, and the ones where 

( 14 . 6 ) Grit) < Gr'it) 

shall be denoted by Ji 1 *, 7i 2) , (Note that Grit) — Gy.it) and Gy.it) — Grit) 
are constant positive numbers on 7+° and I~ ] respectively.) The length of 
those intervals will also be denoted by I+\ I-\ respectively. 

Define the positive real numbers A (1) , A (2) , ••• and 7? rl) ,7? (2) , • • • by 

(14.7) = Ii a \Grit) - Gy.it )) , t e Il a) , 

(14.8) Jjw s I^iGy.it) - Grit)) , t e Il? ] . 

Then by (14.4), 

(14.9) A'" + AM + . . . = BM + BM + . . . . 

The following theorem gives a simpler characterization of "greater infor- 
mativeness” than Blackwell-Girshick’s Theorem 12.4.1 [4]. 

Theorem 14.3. When m = 2, (77) <=* (c). 

PROOF. Without loss of generality, let n n < n l2 < • • • < n ln and n[i < n[ z < 


g(u) = 2 Csftgiu) + au + b , 

8 

where c s ^ 0. Now without loss of generality, we assume that 

K U < n 12 < *“ < K ln an( ^ K ll < K 12 < *“ < K 1 n * 

Then it is clear that we have 


Then 


Frit) = £ Qjftimj), Fy(t) = £ sri/,(>4> . 

jal fc = l 


2 qjgin u) = 2 c s Fr(W + an + b , 

i 8 


2 = 2 csFy.its) + an + b , 


since 


2 giw !j = 2 = r . 

Therefore condition (y>.) is equivalent to condition (F), since c a ^ 0. 
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6a 


Figure 6 


6b 


••• <7t[ n ,. First, we shall prove (c)=*(n). For a fixed j, we have by- 
condition (c) 

(14.10) TTij ^ 7T1! < * * * < n[ n f ^ 7T u >i . 

Let q , q' be any vectors such that (14.1) holds. Then GrOrij) < 1 = G y / K 7r[ n 0- 
Therefore, there exists an index k such that 

(14.11) Gr'Grib-0 ^ GA*u) ^ GM) , 

(where, when k = 1, we consider only the right-hand inequality). Then from 
the monotonicity of G r (t) and Gr>(t) (see Figure 6b) 

Gy(t) ^ Gyt(i) , 0 < t < n[k , 

Gyt(t) 5s Gy(t) , TTifc ^ t ^ 1 . 

Therefore, all intervals /i 0 , /| 2) , ••• are located on the left-hand side of n[ k 


This content downloaded from 128.197.26.12 on Mon, 27 Jun 2016 09:07:40 UTC 
All use subject to http://about.jstor.org/terms 





INFORMATION SYSTEMS 


171 


and all the intervals \ ••• are located on the right-hand side of n[ k . 

Therefore by (14.9) and the definitions of F y (t) and Fy(t), clearly, condition 
(F) holds. Then by Theorem 14.2, it follows that (c)=>(Tl). 

Next, in order to prove that (Il)=»(c), suppose that (c) does not hold. 
Then there exists at least one pair of points among 7rJ If • • •, 7r' ln -, — say n[ k and 
such that the interval {n[ k , nlk-i) contains at least one point of {tth, • • • , 7r lw } 
—say Xu, • • • , TC\j-rsi s^O. We shall show that there exists a pair of vectors 
q = (qi t • * , q»), Qj > 0, ^ <27 = 1 and q* - (tfl» * * * , Qk > 0, 2 q k = 1 such that 
(14.13) Tlq = Il'q' , 

and for the two corresponding information systems Y and Y' , the respective 
functions F y (t) and F r *(t) do not satisfy condition (F). Then by Theorem 
14.2, we can conclude (77, q ) > (77', q '), i.e., (II) does not hold. 

We can choose q = (# lf • • •, q n ) and <?' = (qi, ••*, qv) so as to satisfy (14.9) 
and the following conditions (14.14.1)-(14.14.5): 


(14.14.1) 

0 < Qn < Qh , 

g = l. • • • , i - 1, j + l, • • • , j + s - l, 

j + 8 + 1 , •••,«. h — j, j + s , 

(14.14.2) 

= 1 

< 7=1 


(14.14.3) 

0 < q' g « q' h , 

g — 1, • • • , Jc — 1, Jc + 2, 

h= k, k + 1 , 

(14.14.4) 

h = 1 


(14.14.5) 

i— i k j+s 

S 9* < £ 9ft < £ 9® > 


g=L h—L g—1 


where for two positive numbers a and b, a < b denote that a is infinitesimally 
small compared to b. Now, in numbering the intervals 7+° and I- ) , let us 
define ll l) and 7i 1) by I+ ] = ttia+i], 7~ } = [ttIa, Then 

(14.15) AM s (jt{*+i ~ TTU+sXGrit) - Gy(t)\ Xlj + s ^ t ^ n[ k + i , 

(14.16) B< 1) = (ffii ~ nlkXGAt) - Grit)) , ^(/c ^ t ^ 7TU . 

Then in addition to (14.9), by (14.14.1) and (14.14.3), we have 

(14.17) A< 2) + A (3) + ••• « A* 11 , 

(14.18) B (2) + £ (3) + ••• < 2? (1) . 

Then by (14.9) 

(14.19) - £< 1} = (£< 2) + £ (3) +•••)- (A (2) + A (3) +•••)• 

Therefore, by (14.17), (14.18), A (1) — B {1) is almost zero. Accordingly by f 14.17) 

(14.20) A (2) + Aw + ••• <C Bw . 

This implies that we cannot have Frit) ^ F y \t) over the whole sub-interval 
(n[ k , TZij). We have thus shown that if (c) does not hold , then (F) does not 
hold and by Theorem 14.2, (77) cannot hold. Hence (Tl)==>(c), completing the 
proof of the theorem. 
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15. SUMMARY 

Given the set X of states of nature, described in unlimited detail and 
with a probability measure defined on it, we consider its partitions: Z (set 
of events z) and Y , Y', Y", ••• (information systems, i.e., sets of mutually 
exclusive messages denoted by y in Y , y f in Y r etc.). The set Q z consists of 
all payoff function co for which Z is a payoff -adequate (i.e., not too coarse) 
partition of X. Let the set of feasible decisions d entering the argument 
of co be D”; thus co is a real valued function on Z x D a . The (gross) 
expected payoff of an information system Y is 

U(Y;w) = ^Pr (xezny)-w(z, d y ) , 

z,y 

where d y is a decision responding optimally to message y ; that is, 

2 Pr (x e z n y)[co(z f d y ) — w(z, d)] ^ 0 , all d e . 

z 

A class of partially ordering relations is defined on the set of information 
systems Y, Y f , •••, as follows: Y is said to be more informative than Y' , 

as revealed by a given property of the distribution on Z X Y X Y r , if this 
property implies that 

U(Y;a>)^ U(Y';a>), all weO M . 

In particular, greater informativeness of Y relative to Y' may be revealed 
by comparing the (bivariate) distributions on Z X Y and on Z X Y'\ by 
comparing the matrix of likelihoods A = [X zy \ = [Pr(xezC\y)IFr (xez)], with 
the correspondingly defined matrix A'; or by comparing the matrix of posterior 
probabilities, 77 = [7r z2/ ] = [Pr (x e z n y)l Pr (x e y)\, with the correspondingly 
defined matrix 77. The latter comparison is the strictly stronger one 
(Theorem 5.3) except in the important case when the messages of the more 
informative system (each message being represented by a column of 77, or 
of A ) are linearly independent: Theorems 9.5, 9.7. The numerical criteria 
for comparison of informativeness, given in Theorem 13.2 are all equivalent, 
and are strictly stronger than that of Theorem 9.1; but they are all equivalent 
in the case of linear independence between messages. This case applies in 
particular to information systems that are noiseless with respect to— i.e., are 
sub-partitions of— the set Z of events (Section 11); or are binomial— the case 
of two messages,— or binary— the case of two messages and two events 
(Section 10). Furthermore, for the “dichotomy” cases (i.e., the case of only 
two events), a certain property of posterior probabilities is proved to be 
necessary and sufficient for Y to be more informative than Y' (Section 14). 
The “equivocation” parameter of classical information theory, as well as its 
generalization to any other convex operator on posterior probabilities, is 
shown to be a necessary but not sufficient criterion of comparative informa- 
tiveness; but a stronger criterion involving all convex operators on posterior 
probabilities is shown to be both sufficient and necessary (Section 12). 

Statements on comparative informativeness can also be derived from the 
knowledge, not of the numerical properties of the probability distribution 
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of events and messages, but of the process by which messages constituting 
one information system are generated from those of another system. Thus 
it is shown that “garbling,” of Y into Y', a well-defined property of some 

distributions on Z X Y X Y’, can never make Y ' more informative than Y. 

A special case of garbling is “collapsing” information, and a still more 
special case is disjoining” it: Theorem 7.3. Moreover, in the case of 

garbling of linearly independent messages, the Markov matrices used in 

some of the criteria of informativeness and relating the distribution on 
Z X Y to that on Z X Y' receive an insightful interpretation in terms of 
conditional probabilities of the messages of one system, given the message 
of another: Theorem 9.3. 

To use our results for a definitive economic comparison of information 
systems one would have to know information costs. Unless these are additive, 
one would have to discard the concept of gross payoff functions (Section 4). 
One should, moreover, characterize the feasible set of decision rules 
(rather than the set D m of feasible decision), and take account of the cost 
which is associated with a decision rule; this cost, too, is in general non- 
additive, thus calling for a further re-definition of (net) payoffs. 

In addition to these open questions of an economic nature, the mathemat- 
ician will notice that by confining ourselves to finite sets of events and mes- 
sages, we have neglected measure-theoretical difficulties which were attacked 
by other authors (see footnote 14). 

More general studies are therefore necessary. We hope the present study 
contributes to stimulating and preparing them. 

University of California, Los Angeles , U.S.A. 
and University of Tokyo, Japan 
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